Deep Dive | Experimentation for leaders
📄

Deep Dive | Experimentation for leaders

What is this document?

We hosted a deepdive on Experimentation with Gautham Krishnan (Product leader, Disney+Hotstar) & Pramod N (Head of product & data science, Rapido), hosted by Saket Toshniwal (Sr. Director, MoEngage), where we covered the nuances how to run an experiment, end to end.

This was a super actionable and interesting conversation, and this document is a summary of the conversation.

image.png

Before we begin, a little about our guests:

Gautham Krishnan (Product leader, Disney+Hotstar)

Gautham is currently a product leader Disney+Hotstar. He has previously worked in product leadership roles in companies like Gameskraft, Snapdeal and Honestbee. Apart from this, he’s also a part of the advisory council for ISB’s PGP programme.

Pramod N (VP Product & Data Science, Rapido)

Pramod is currently the VP Head of Product & Data Science at Rapido, where he’s responsible fo product-led growth and bringing in efficiencies to growth across customer, captain and marketplace levers. Before this, he was the Lead Consultant at Thoughtworks.

@Saket Toshniwal (Sr. Director, MoEngage)

Saket has over 13 years of experience of working in product & growth roles. He’s currently a Senior Director at MoEngage. He's been a GrowthX member since the last 6 months, and a lot of you must already have interacted with him at different events.

We covered 4 main topics in the session:

  • Litmus test on when to experiment
  • ⁠⁠What to do pre, during, post the experiment
  • How to analyse the results
  • How to manage stakeholders

In this document, we're going the cover the questions asked, and a summary of answers given by panelists.

Litmus test on when to experiment?

There are a couple of factors when it comes to a litmus test for experimentation:

  1. Organisation and cultural alignment
  2. Degree of uncertainty
  3. Cost vs benefit

Organisational and cultural alignment

Experiments are not standalone events; they depend on an ecosystem where teams believe in data-driven decision-making. Without alignment, even the best-designed experiments fail to create impact. Stakeholders may question the validity of data or reject findings that contradict their beliefs.

Gautham pointed, "You can do a bunch of experiments, you can prove the data, you can put it in front of people, but if they don’t accept it, there is no point in conducting that experiment, right?"
Pramod pointed, "The biggest barrier that you will often face is a person who knows everything. I mean, it’s not even the fact of knowing everything, it’s just they think they know everything"

The cultural resistance that can block experimentation. This resistance stems from a lack of alignment or a reliance on intuition rather than data.

Key steps for alignment here:

    1. Educate teams about the benefits of experimentation.
    2. Align leadership to prioritize learning over ego.
    3. Ensure there is a shared understanding of metrics, so all teams are evaluating results on the same baseline.

Key steps on how to overcome resistance:

    1. Cultural Shifts: Begin with small, low-risk experiments to show the value of iterative learning.
    2. Top-Down Support: Leaders must advocate for experimentation and model data-driven decision-making.
    3. Transparency in Results: Sharing outcomes openly fosters trust and breaks down skepticism.

Degree of uncertainty

Experimentation provides clarity in situations where the outcome is unpredictable. This could include user behaviour, market reactions, or operational changes. There are 2 types of problem statements - Tier 1 problems and Tier 2 problems

Tier one decisions:

These are irreversible, high-stakes changes. Gautam explained:

“Once you increase prices, you cannot go back. There might be a big consumer backlash.”

Examples: Overhauling the payment system, launching a new pricing strategy, or changing core app navigation.

Experimentation ensures risks are mitigated before full-scale deployment. For instance, if a new payment system is tested, the company can understand user drop-off rates and failure points before global rollout.

Tier two decisions:

These are low-risk, reversible changes that don’t require extensive validation.

“These are revolving doors where... you can pull back the decisions you’ve taken.”

Examples: Introducing a new banner layout, tweaking button colors, or making minor adjustments to existing flows.

These can be rolled out directly with mechanisms for rollback if needed.

Cost vs. experimentation

Not all problems merit experimentation. If the cost (time, resources, risk) of running an experiment outweighs the value of the insights, direct action may be a better choice.

If it’s more expensive to run an experiment than to just do it, then don’t bother running an experiment.

Examples:

  • Testing small UI changes may not be worth it if they don’t significantly impact core metrics like engagement or retention.
  • Large-scale changes (e.g., altering user onboarding flows) often justify experimentation due to their potential to impact KPIs.

Balancing trade-offs:

Leaders must decide when the learning potential of an experiment outweighs its execution costs. For instance:

  • A/B testing pricing plans is worth the cost because of the long-term revenue implications.
  • Tweaking the order of elements on a non-critical page may not be worth the effort.

Pramod and Gautham shared examples of navigating through tier-1 problem statements

Hotstar was revamping the entire UI?UX for it's app. This revamp was not just about cosmetic changes but fundamentally altered how users interacted with the platform, impacting critical metrics and user behaviours.

Why was it difficult to experiment:
The scale and scope of the changes made traditional A/B testing impractical. Gautam explained:

“When we revamped the entire app UI/UX, all the data changed—the clickstream, user navigation patterns, and how content was consumed. These changes are so significant that they couldn’t be A/B tested effectively.”

The changes were considered a tier one decision because they had broad implications for user engagement and content consumption, making it impossible to roll back completely once implemented.

How they approached it:
Instead of traditional A/B testing, they employed a graceful rollout strategy:

    • Limited deployment: The changes were released to a subset of users in a controlled environment.
    • Metric monitoring: They focused on the North Star metric, which was "watch time per video viewer," to assess engagement.
    • Iterative adjustments: Bugs and UX issues identified during the rollout were fixed incrementally.

Outcome considerations:
This strategy ensured they could capture user feedback and tweak elements before a full-scale rollout. However, such decisions required alignment across teams, as the impact on user behaviour and business metrics was substantial.


Pramod discussed the major decision at Rapido to shift from a commission-based model to a subscription-based model. This was another tier one decision, as it involved a fundamental change in the company’s business structure with long-term implications.

Why was it challenging:

“This is one of those tier one decisions. It has insane risk; it changes you for the next 50 years.”
A wrong move could alienate key stakeholders (e.g., drivers and users), reduce trust, and negatively impact revenue streams.

How did they approach it:

    1. Breaking down the problem:
      The team decomposed the decision into smaller, testable behavioural assumptions. For example:
      • Would drivers accept paying a subscription fee instead of the traditional commission?
      • How would subscription fees impact retention and performance?
    2. Behavioural experiments:
      They ran proxy experiments to test these assumptions:
      • Offering subscription-like incentives to drivers and analysing their responses.
      • Observing changes in driver behaviour under these test conditions.
    3. Time investment:
      Pramod explained:“We spent three months validating behavioural insights to support this strategic shift.”

Outcome Considerations:
These smaller experiments provided critical data that helped mitigate risks and informed the final decision. While the shift itself wasn’t directly tested at scale, the insights gathered minimised uncertainties.

How to form your hypothesis?

A well-crafted hypothesis one-pager is important to guide an experiment effectively. It ensures clarity, aligns stakeholders, and sets the stage for meaningful insights. This is the first and the most important step while conducting any experiment.

Problem statement:
  • Define the exact issue or opportunity the experiment aims to address.
  • Ensure it’s specific, measurable, and directly tied to the product or business goals.
  • Example: At Disney+ Hotstar, a hypothesis for push notifications might start with: “We hypothesise that optimising transactional push notifications will increase the likelihood of completing a transaction.”

Objective and hypothesis:
  • Clearly articulate the hypothesis: a falsifiable statement predicting the outcome of the intervention.
  • Example format: “If [specific change] is implemented, then [expected measurable outcome] will occur because [rationale].”
  • Example:
    For engagement metrics, the hypothesis might be: “If we personalise homepage tiles based on user preferences, then click-through rates and watch time will increase because users will find relevant content more quickly.”

North star metric:
  • Identify the primary metric that will determine success or failure.
  • This should align with the product’s or organisation’s strategic goals.
  • Gautam’s example:“Push notifications often use CTR as the metric, but that’s a leading indicator. The real North Star metric should measure the ultimate goal, such as transaction completion or user engagement.”

Guardrail metrics:
  • Include metrics to monitor unintended consequences or trade-offs. For example:
    • Does an increase in one metric cause a decline in others?
    • Does a feature benefit a subset of users at the cost of others?
  • Gautam’s example: When optimising content recommendations, guardrail metrics like overall platform engagement are monitored to avoid cannibalising other categories.

Sample and cohort definitions:
  • Define the target audience for the experiment (e.g., new users, returning users, or a specific segment).
  • Ensure the homogeneity of the test and control groups to avoid skewed results.
  • Key Insight:
    Gautam stressed the need for representative samples: “One of the main reasons experiments fail is because the population isn’t homogeneous. The control and test groups should reflect the wider user base.”

Experiment design and execution plan:
  • Detail the methodology, including:
    • Intervention details (e.g., feature rollout, UI change).
    • Duration of the experiment (based on expected statistical significance).
    • Tools or platforms for tracking metrics.

Risks and mitigation:
  • List potential risks (e.g., technical issues, user backlash, or misinterpretation of results).
  • Include strategies for mitigating these risks.

Expected outcomes and impact:
  • Define what success looks like for this experiment and how it ties to broader goals.
  • Include a hypothesis-to-outcome mapping, such as:
    • Hypothesis: Personalised homepage tiles will increase engagement.
    • Expected Outcome: A 10% increase in watch time per user.

While forming hypothesis, you should also analyse risks

  1. Feature/Product Value Risk: Does the feature solve a meaningful user problem?
    • Evaluate whether the hypothesis addresses a core user need or is a vanity feature.
    • Use frameworks like RICE (Reach, Impact, Confidence, Effort) to prioritise.
  2. Usability Risk: Will users understand and effectively use the feature?
    • Test UI/UX elements in prototypes or small cohorts to minimise this risk.
  3. Execution Risk: Can the intervention be built and delivered within constraints?
    • Ensure the team has the bandwidth and technical capability to execute the hypothesis.
  4. Business Value Risk: Does the feature align with business goals?
    • Ensure alignment with key stakeholders to avoid misaligned priorities.

To ensure clarity and focus, follow these tips:

  • Keep It Falsifiable: A good hypothesis is testable and can be proven right or wrong.
    Example: “If we improve onboarding instructions, drop-off rates during onboarding will decrease by 15%.”
  • Focus on Causality: Clearly link the intervention to the expected outcome.
    Example: “If we reduce delivery fees, customer retention will increase because lower costs improve perceived value.”
  • Use Data: Base hypotheses on existing behavioral data or research insights.

When should you start experimenting?

Based on user base

There is no strict rule that experimentation requires a minimum number of Monthly Active Users (MAUs) or Daily Active Users (DAUs). Instead, it depends on what you are trying to validate and the type of risks you’re addressing. Pramod explained:

“For usability risks or user-related risks—like whether users will adopt a new feature—you can test even without a large active user base. You don’t even need tech to test these risks.”
  • Example: Early-stage startups can conduct offline experiments such as user interviews, fake door tests, or prototypes to validate behaviours before investing in product changes

Based on the stage of company

Early-stage startups:
“You can test without a large DAU or MAU. If you understand the principles of experimentation, you can start even with just 100 users.”

  • Experimentation at this stage is often lightweight and scrappy. You can use tools like surveys, prototype testing, and qualitative interviews to validate ideas.
  • Example: Pramod mentioned fake door experiments, where you present users with a hypothetical feature or offer to gauge interest without building it.

Growth-stage companies:

  • At this stage, experimentation can become more data-driven as you have a larger user base and more resources. Here, you focus on:
    • Optimising user experiences.
    • Scaling high-impact features.
    • Addressing specific behavioural changes through data-driven hypotheses.

Mature companies:

  • Experimentation focuses on improving efficiency, driving retention, and optimizing for long-term engagement.
  • Example: Gautam’s experiments at Disney+ Hotstar focused on refining push notification strategies and testing UI changes for better watch time.

How to setup a metrics dashboard for the experiment?

Context and objective

The first step is to provide the analytics team with a clear understanding of why the dashboard is needed and what it aims to achieve.

Gautam’s Approach:

“The definition of the metric changes from team to team... The first step is ensuring everyone is aligned on what the metric means and how it’s calculated.”

  • The brief should start by outlining the experiment or decision the dashboard supports.
  • Define the primary goal: Is the dashboard meant to track engagement, measure experiment outcomes, or monitor platform health?

Key metrics: North star and guardrail metrics

Clearly define the North Star metric (primary metric of success) and guardrail metrics (to track unintended side effects).

Gautam’s Insight:

“Push notifications often focus on CTR as the metric, but that’s a leading indicator. The real North Star metric should be the transaction completion rate or the time spent engaging with content.”

  • What to Include in the Brief:
    • Primary Metric (North Star): Clearly define the metric most aligned with the organisation’s goals.
      Example: For subscription experiments, track retention rate or lifetime value (LTV).
    • Guardrail Metrics: Include secondary metrics to catch potential trade-offs.
      Example: Monitor churn rates to ensure they don’t increase due to price changes.

Homogeneity test

Analytics dashboards should account for homogeneity to ensure valid comparisons between control and test groups.

Gautam’s Point:

“One of the main reasons experiments fail is because the population isn’t homogeneous... The control and test groups must be representative of the wider user base.”

What to Include in the Brief:

  • Highlight the importance of checking for homogeneity before and during the experiment.
  • Specify cohorts to track for representativeness (e.g., age groups, locations, app usage behavior).


Data dictionary

Include a data dictionary that lists all events, variables, and metrics used in the dashboard.

Gautam’s Take:

“A data dictionary ensures that variables or events are consistently understood across teams, avoiding confusion about what’s being tracked.”

Example Brief Section:

Include definitions for key events like:

  • “Subscription Event”: User clicks ‘Subscribe’ and completes payment.
  • “Churn Event”: User cancels within 30 days.

How do you manage stakeholders for an experiment?

Establish pre-alignment

Stakeholders are more likely to support an experiment if they clearly understand its purpose, expected outcomes, and how success will be measured. Pre-alignment ensures there’s no confusion or conflict once the experiment process has begun.

Gautham's perspective:

If there is no alignment, and if there is no culture of experimentation, there’s no point in running that experiment.”
Experiments should only proceed when all stakeholders agree on the problem, hypothesis, and metrics. Misaligned expectations often lead to resistance or dismissal of the results.

How to Do It:

  • Schedule kickoff meetings with key stakeholders to define the experiment’s scope and objectives.
  • Use clear, jargon-free language to explain why the experiment is necessary.
  • Agree on success metrics (e.g., North Star metric) and guardrails to monitor unintended consequences.

Example:
Before launching a pricing experiment, the team aligns on the primary objective (e.g., improving revenue), success metric (e.g., revenue per user), and guardrails (e.g., churn rate shouldn’t increase beyond 5%).

Build cultural alignment around experimentation

A lack of experimentation culture can lead to skepticism, resistance, or outright rejection of findings. Experiments can be perceived as time-wasting or threatening, especially if they challenge long-standing beliefs.

Gautam’s Insight:

“We need to be dispassionate about the outcome. It’s not about whose idea it is, but whether the idea works.”
He emphasised creating a culture where data, not egos, drive decisions. This requires normalising failure as part of the learning process.

Pramod’s Insight:

“The biggest barrier is often people who think they know everything... It requires somebody to accept that there is a scientific way of learning things.”
At Rapido, he worked to establish a culture where experiments were seen as tools for discovery rather than tools to prove someone wrong.

How to Do It:

  • Educate stakeholders: Run workshops or presentations to demonstrate how experiments reduce risk and improve outcomes. Share success stories from within or outside the organization.
  • Normalise failure: Position failed experiments as learning opportunities. Publicly highlight what was learned rather than focusing on who proposed the idea.

Example:
At Disney+ Hotstar, Gautam emphasized the importance of framing results objectively, ensuring that no individual or team was blamed for an unsuccessful experiment.

Align experiments with broader business goals

Stakeholders are more likely to support an experiment if they see how it contributes to the organization’s strategic objectives, such as revenue growth, user retention, or operational efficiency.

Gautam’s Insight:

“Stakeholders should see that experiments contribute to larger business goals, such as retention, revenue, or engagement.”
Experiments must be framed in the context of how they advance these goals.

Pramod’s Insight:

“If you’re dealing with significant unknowns, the cost of inaction is often higher than the cost of running an experiment.”
He used this argument to justify experimentation for high-stakes decisions, such as transitioning Rapido to a subscription model.

How to Do It:

  • Map each experiment to a broader business outcome.
  • Show how the experiment addresses uncertainty and reduces risk.
  • Use data to highlight potential downsides of not experimenting.

Example:
For a new subscription plan, frame the experiment as a way to increase driver retention and stabilize revenue, directly linking it to organizational growth goals.

What to do post session?

Revalidate sampling and homogeneity

Before interpreting the results, ensure that the sampling process was accurate and that the test and control groups were homogeneous. A failure in sampling can invalidate the conclusions.
Gautam’s Insight:

“After the experiment, we still check for homogeneity to ensure the control and test groups are comparable.”

How to do it:

  • Compare the demographic and behavioural characteristics of the test and control groups to confirm they are representative of the broader population.
  • Use statistical tests like t-tests or ANOVA to check for significant differences between groups.

Example:
For a new subscription feature, verify that control and test groups had similar engagement levels and subscription behaviour before the intervention.


Analyse leading metrics

Leading metrics (e.g., click-through rates, interaction frequency) provide an early indication of whether the intervention drove the expected behaviour.
Pramod’s Insight:

“Before diving into lagging metrics, check if the leading metrics moved as expected. If they didn’t, the experiment might not have had the intended impact.”

What to Look For:

  • Did users interact with the tested feature?
  • Was there an improvement in short-term metrics such as clicks or session time?

Example:
If a push notification experiment aimed to increase engagement, leading metrics like click-through rate or notification open rate should show positive changes.


Measure lagging metrics

Lagging metrics, such as retention, revenue, or lifetime value (LTV), provide a more comprehensive picture of the experiment’s long-term impact.
Gautam’s Insight:

“Click-through rates are a leading metric, but the real goal is whether the North Star metric—like transaction completion or watch time—increased.”

What to Look For:

  • How did the experiment affect your North Star metric?
  • Did the changes translate into meaningful outcomes, such as improved retention or revenue?

Example:
For a subscription experiment, measure whether users who received a personalized offer showed a higher retention rate over 30 days compared to the control group.


Assess guardrail metrics

Guardrail metrics help you identify unintended consequences of the intervention.
Gautam’s Insight:

“When optimising content, we realized that increasing click-through rates on tiles could cannabalise overall engagement by shifting attention away from long-format shows.”

What to Look For:

  • Did any metric deteriorate unexpectedly?
  • Were there trade-offs between metrics?

Example:
For a UI revamp, guardrail metrics like app crash rates, session lengths, or opt-out rates for notifications can highlight negative impacts.


Validate statistical significance

Statistical significance ensures that observed differences are unlikely to have occurred by chance.

Pramod’s Insight:

“Human behaviour is unpredictable. Before concluding, check whether the changes are significant and not just within the range of natural variability.”

How to Do It:

  • Use statistical methods such as p-values, confidence intervals, or z-tests to validate the results.
  • Compare the standard deviations of test and control groups to assess variability.

Example:
For an experiment showing a 2% increase in retention, ensure that this change exceeds the confidence interval and is statistically significant.


Identify and investigate anomalies

Unexpected results may indicate user segmentation issues, technical errors, or other confounding variables.
Pramod’s Example:

“If fulfillment improved by 1.5%, the next question is whether that change is meaningful or just a result of a sampling bias.”

How to Investigate:

  • Look for outliers in the data.
  • Revisit logs to check for technical glitches during the experiment.
  • Break down metrics by segment (e.g., geography, age, or usage frequency) to identify patterns.

Example:
If a pricing experiment led to higher revenue in one region but lower revenue elsewhere, investigate regional preferences or economic factors.


Compare results with predefined hypotheses

Hypotheses provide the framework for evaluating success and learning from the experiment.
Gautam’s Insight:

“Experiments should test hypotheses directly. If the hypothesis was poorly formed, you might get results that don’t align with your goals.”

How to Do It:

  • Revisit the hypothesis one-pager to confirm if the results match the predicted outcomes.
  • Evaluate whether the observed changes align with the rationale behind the hypothesis.

Example:
If a hypothesis predicted that reducing onboarding steps would lower drop-off rates by 20%, check if the drop-off decreased by at least that amount.


Use holdouts and reverse A/B testing

Holdouts and reverse A/B testing provide additional validation by comparing users exposed to the change with those who were not, even after the experiment ends.
Pramod’s Insight:

“Reverse A/B testing helps you understand whether users revert to old behaviors when the feature is removed.”

How to Do It:

  • Retain a small holdout group to monitor long-term behavior without the intervention.
  • Conduct reverse A/B testing by removing the change and observing its impact on behavior.

Example:
For a loyalty program, check whether users who lose access to benefits show a drop in engagement compared to those still enrolled.


Document insights

Proper documentation ensures that findings are accessible for future experiments and decision-making.
Gautam’s Insight:

“Documenting learnings helps improve the roadmap and ensures that everyone—product, analytics, design—benefits from the experiment. We socialize results widely so that everyone—engineering, design, marketing—understands the learnings and uses them to make better decisions.”

What to Document:

  • Objectives, methodology, and outcomes.
  • Key learnings, including both successes and failures.
  • Recommendations for future experiments.

Example:
For a UI experiment, document how changes improved the North Star metric and note any technical challenges faced during implementation.















Brand focused courses

Great brands aren't built on clicks. They're built on trust. Craft narratives that resonate, campaigns that stand out, and brands that last.

View all courses

All courses

Master every lever of growth — from acquisition to retention, data to events. Pick a course, go deep, and apply it to your business right away.

View all courses

Explore foundations by GrowthX

Built by Leaders From Amazon, CRED, Zepto, Hindustan Unilever, Flipkart, paytm & more

View All Foundations

Crack a new job or a promotion with the Career Centre

Designed for mid-senior & leadership roles across growth, product, marketing, strategy & business

View All Resources

Learning Resources

Browse 500+ case studies, articles & resources the learning resources that you won't find on the internet.

Patience—you’re about to be impressed.